Joint Learning of Character and Word Embeddings
نویسندگان
چکیده
Most word embedding methods take a word as a basic unit and learn embeddings according to words’ external contexts, ignoring the internal structures of words. However, in some languages such as Chinese, a word is usually composed of several characters and contains rich internal information. The semantic meaning of a word is also related to the meanings of its composing characters. Hence, we take Chinese for example, and present a characterenhanced word embedding model (CWE). In order to address the issues of character ambiguity and non-compositional words, we propose multipleprototype character embeddings and an effective word selection method. We evaluate the effectiveness of CWE on word relatedness computation and analogical reasoning. The results show that CWE outperforms other baseline methods which ignore internal character information. The codes and data can be accessed from https://github.com/ Leonard-Xu/CWE.
منابع مشابه
Word and Document Embeddings based on Neural Network Approaches
Data representation is a fundamental task in machine learning. The representation of data affects the performance of the whole machine learning system. In a long history, the representation of data is done by feature engineering, and researchers aim at designing better features for specific tasks. Recently, the rapid development of deep learning and representation learning has brought new inspi...
متن کاملA Joint Model for Word Embedding and Word Morphology
This paper presents a joint model for performing unsupervised morphological analysis on words, and learning a character-level composition function from morphemes to word embeddings. Our model splits individual words into segments, and weights each segment according to its ability to predict context words. Our morphological analysis is comparable to dedicated morphological analyzers at the task ...
متن کاملEffective Word Representation for Named Entity Recognition
Recently, various machine learning models have been built using word-level embeddings and have achieved substantial improvement in NER prediction accuracy. Most NER models only take words as input and ignore character-level information. In this paper, we propose an effective word representation that efficiently includes both the word-level and character-level information by averaging its charac...
متن کاملMulti-level Representations for Fine-Grained Typing of Knowledge Base Entities
Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings). We investigate state-of-theart learning methods on each level and f...
متن کاملContext-Specific and Multi-Prototype Character Representations
Unsupervised word representations have demonstrated improvements in predictive generalization on various NLP tasks. Much effort has been devoted to effectively learning word embeddings, but little attention has been given to distributed character representations, although such character-level representations could be very useful for a variety of NLP applications in intrinsically “character-base...
متن کامل